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Abstract: 

A descent algorithm, "Quasi-Quadratic Minimization with Memory" 
(QQMM), is proposed for unconstrained minimization of the sum, F, of 
a non-negative convex function, V, and a quadratic form. Such problems 
come up in regularized estimation in machine learning and statistics. In 
addition to values of _F, QQMM requires the (sub)gradient of V. 

Two features of QQMM help keep low the number of evaluations of the 
objective function it needs. First, QQMM provides good control over stop- 
ping the iterative search. This feature makes QQMM well adapted to sta- 
tistical problems because in such problems the objective function is based 
on random data and therefore stopping early is sensible. Secondly, QQMM 
uses a complex method for determining trial minimizers of F. 

After a description of the problem and algorithm a simulation study 
comparing QQMM to the popular BFGS optimization algorithm is de- 
scribed. The simulation study and other experiments suggest that QQMM 
is generally substantially faster than BFGS in the problem domain for which 
it was designed. A QQMM-BFGS hybrid is also generally substantially 
faster than BFGS but does better than QQMM when QQMM is very slow. 
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1. Introduction 
1.1. Motivation 



In order to make progress in solving a real world prediction problem of inter- 
est to the Department of Psychiatry at Columbia University, I have been led 
to develop a nonparameteric hazard function estimator for survival analysis. 
(This estimator will be described in another paper.) Fitting the nonparamet- 
ric models required high dimensional numerical optimization for which I used 
the well regarded, but rather generic, BFGS optimizer. (See subsection 11.21 for 
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references.) I found that with BFGS the model fitting required hours. (Once it 
took f9 hours.) 

Such long optimization times severely hampered development of the estimator 
so I cast about for an alternative to BFGS. My optimization problem is convex 
and convex optimization is an intensively studied problem (subsection ll.2p . I 
was quite surprised, therefore, when I could not find an off-the-shelf, generic, 
unconstrained convex optimizer suitable to my problem. (More about this in 
subsection 11.21 ) Consequently, I developed an apparently new, non-generic, op- 
timizer, QQMM, that is the subject of this paper. I found that hazard estimation 
using QQMM generally took less than one hour, often substantially less. (See 
section m for an example.) 

QQMM is designed for a specialized optimization problem class. However, 
the problem domain for which QQMM is appropriate is sufhciently broad that 
others might find QQMM useful. 

1.2. Background 

A general strategy for nonparametric function estimation problems is as follows 
(ii El Sill). 



1. Determine empirical risk functional. (Regard a minus log likelihood func- 
tional as a kind of risk functional.) 

• E.g., mean squared residual or misclassification rate. 

2. To make optimization easier, if the risk functional is not already convex 
bound it above by a convex functional, V. Use V in place of risk. 

• As y is reduced, the original risk functional will tend to go down. 

3. Only consider estimates in some "Reproducing Kernel Hilbert Space" 
(RKHS; SlV 

4. Regularize (jlSl . p. 34) by adding a penalty that is a multiple of the squared 
norm in the RKHS. 

5. Minimize the sum: V + penalty. 

This strategy is advocated, at least for classification problems, in the following 
quote from ([sl, p. 337). "In our view, what is special about SVMs [Support Vector 
Machines] is the combination of the following ingredients: first and foremost, the 
use of positive definite kernels [item [3] in the preceding list] ; then regularization 
via the norm in the associated reproducing kernel Hilbert space [item |4] ; finally 
the use of a convex loss function [item [2] ..." 

Thus, the idea is to minimize, in h G Hk, where Hk is an RKHS, a functional 
of the form. 

Fih; Y) := Fy{h) := Vih; Y) + X\\h\\l. (1) 

Here, Y is the data, F is a convex functional that is larger than the empirical 
risk or minus log likelihood, A > is constant (the "complexity parameter"), 
and |[ • \\k is the norm of Hk- F is the "objective function" (or "objective", for 
short). Call the expression A|[/i|||. the "penalty term." (See subsubsection 13.1.^ 
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for an example.) There may also be side constraints on h, but in this paper 
we consider minimizing the functional ^ with no side constraints. We wish to 
find an approximation to a vector h (a "minimizer") at which F achieves its 
minimum value. 

Thus, this paper concerns a class of convex optimization problems. Convex 
optimization is an important research area (e.g. j3;[3;[21; 16; i3))- However, it 
is difficult for most statisticians to tap into the products of this research area. 
To quote yj), 

. . . there remains a significant impediment to the more widespread adoption of 
convex programming: the high level of expertise required to use it. With mature 
technologies such as [least squares] , [linear programming] , and [convex quadratic 
programming], problems can be specified and solved with relatively little eff'ort, 
and with at most a very basic understanding of the computations involved. This 
is not the case with general convex programming. That a user must understand 
the basics of convex analysis is both reasonable and unavoidable; but in fact, a 
much deeper understanding is required. Furthermore, a user must find a way to 
transform his problem into one of the many limited standard forms; or, failing 
that, develop a custom solver. For potential users whose focus is the application, 
these requirements can form a formidable "expertise barrier" especially if it is 
not yet certain that the outcome will be any better than with other methods. 

Grant et al dlJ) have developed, in JVIATLAB (The Mathworks, Natick, JVIA), 
a general purpose convex optimization modeling environment called "CVX" 
(http://www.stanford.edu/~boyd/cvx/). However, "CVX is not designed to 
handle problems such as [the survival analysis problem that prompted the de- 
velopment of QQMM]." -JVI.C. Grant, personal communication.) 

As further evidence for the assertion of (flih note that apparently JVIATLAB 's 
optimization toolbox does not come with functions for general convex optimiza- 



tion. JVIoreover there does not appear to exist a designated R (|23l ) package for 
convex optimization. 

QQMM is an easy to use "custom solver", in Grant et aVs terminology, for a 
large class of statistical problems. In addition to describing QQMM, I compare 
it to a popular "other method", namely BFGS (0; lIlJlS 13, H, section 9.2), 



([22, Section 6.1). BFGS is a "quasi-Newton" method (122). Thus, BFGS requires 
only function values and gradients and with each iteration updates a matrix that 
it uses as one would use the inverse Hessian matrix in Newton's method. (See 
section 131) 

Moreover, an important assumption we make is that V is bounded below by 
a known constant. Without loss of generality, we may take that constant to be 0. 
I.e., assume V is non-negative. If V is not bounded below by a known constant, 
replace V by the point-wise maximum max{y — c, 0}, for some constant, c. By 
taking c sufficiently far below 0, this substitution may have little effect on the 
minimum of the objective. 

In this paper I describe an apparently new algorithm "Quasi-Quadratic Min- 
imization with Memory" (QQMM) for minimizing functions like F. QQMM was 
developed for a survival analysis application of the recipe presented in steps 1-5 
above. (This application will be described in a future paper.) In the application 
to survival analysis the functional V is time consuming to compute. QQMM 
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may be best suited to functions like the survival analysis objective where V is 
expensive to compute. (But see subsection 13.31 ) 

QQMM enjoys a good level of generality: It applies to a broad class of prob- 
lems but does not suffer from the learning curve and inefficiency that come with 
greater generality. 

Because of the kernel property (0, p. 146) minimizing (H]), which prima facie 
is often an infinite dimensional problem, often reduces to a, perhaps large, finite 
dimensional unconstrained problem. Call the dimension of the reduced problem 
d. (See subsubsection 13 . 1 !3l for an example.) 

Kernel-based methods ([T]) provide a motivation for the optimization problem 
we are considering. But a simpler way to describe the problem is as follows. 



Start with a known symmetric, positive definite quadratic form Q ([251) on a 
d-dimensional real linear space, H. Thus, for a given basis on H, there is a 
dx d symmetric, positive definite matrix, Q, such that if h G H has coordinates 
X := (cci, . . . ,Xd) relative to the basis then Q{h) — xQx-^, where indicates 
matrix transposition. Consider minimizing, in h, the function 

F{h) Vih) + Q{h), h e iJ, (2) 

for some non-negative convex function, V , on H. 

QQMM is an algorithm for minimizing QQMM requires the inverse of a 
matrix, Q, defining Q so Q has to be strictly positive definite. (See subsection 
for an explanation and discussion.) 
Remarks: 

1. The quadratic part does not have to come from the penalt y te rm. E.g., 
one should be able to use QQMM for lasso regression (|2^. (|l5l . p. 64) in 
which V provides the quadratic part. 

2. The requirement that V be convex can be relaxed. If V is the difference 
between two convex functions then the minimization of F can be reduced 
to iterative application of QQMM to a sequence of convex y's (0, p. 322), 
(0). 



1.3. QQMM 

The left panel of figure [T] shows QQMM in action on a convex function of two 
variables. Numbers 1 through 9 label the positions at which and the order in 
which the function was evaluated during the iterative search for the minimum. 
(Positions 7, 8, and 9 are crowded together in the middle.) Twice, in moving 
from position 4 to position 5 and in moving from position 7 to position 8, 
QQMM overshot and actually increased the objective function. In those cases 
the algorithm backtracked. (The right hand panel of figure [T] will be explained 
in subsection 12.41 ) 

QQMM requires the functions V and Q and a function, g, that returns sub- 
gradients of V (subsection I2.H subsubsection 12.2. ip . 

Fy in ^ depends on F, a data sample. If Yi and I2 are two different sam- 
ples, the location of the minimum of -FVi will probably not be the location of 
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Fig 1. QQMM search path with a random convex function in two variables. Left panel shows 
contour plot of objective function and the 9 trial minimizers numbered in order of evaluation 
and connected by lines. Right panel shows function values at the trial minimizers (solid curve) 
and lower bounds (dashed). Horizontal axis is evaluation number. (Vertical axis is on square 
root scale.) 
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the minimum of Fy-j . For that reason there is no need to approximate the min- 
imizer of F very precisely. For model selection or even more for resampling (Q), 
computational speed is an issue. An advantage of QQMM is that as it searches 
for a minimum it computes ever more accurate lower bounds on F. One can 
use this lower bound in a stopping rule for the search. For example, one could 
specify that the search stop when the objective and lower bound differ by less 
than 1% (subsection inUl). 

The objective function for the survival analysis algorithm for which QQMM 
was developed is expensive to compute. Therefore, an especially important goal 
in developing QQMM was that it converge without many function evaluations. 
This is accomplished through two means. First is the aforementioned ability to 
stop early. Another way that QQMM is designed to require few evaluations is 
that it performs a fair amount of computation to carefully determine trial min- 
imizers (subsections 12.21 and 12. 3[) . This is cost effective because, as mentioned 
above, QQMM was devised to minimize F for a specific form of V that is ex- 
pensive to evaluate. This extra computational effort may be counter productive 
if computing V is cheap. (But see subsection 13.31 ) 

The cumulative amount of computational overhead for QQMM is approx- 
imately proportional to the square of the number of function evaluations and 
memory requirements are roughly linear in the number of evaluations. If QQMM 
needs many function evaluations and d is large this burden can be important. 

The distribution of the number of objective function evaluations needed by 
QQMM seems to have a long right hand tail. E.g., if the quadratic form Q has 
small eigenvalues, say, if A in ((T|) is small, then QQMM can be quite slow. A way 
to correct that is "BFGS rescue" (subsection 12. 5p in which the BFGS algorithm 
takes over if QQMM runs too long. However, "most of the time" QQMM seems 
not to need rescue (section [3l) . 

QQMM involves matrix operations so that in practice there is a limit to how 
large d can be. I have successfully used QQMM with d as large as 5,000. 

I have implemented QQMM in R. (The source code is available from me.) 
The algorithm is presented in section [2l In section [3] QQMM is compared to 
BFGS in kernel-based, nonparametric L"^/^-regression with thousands of simu- 
lated datasets. A summary and further discussion are presented in section |4l 
(Part of this paper appeared as (flo[ ).) 

Since QQMM is rather complex, I make no attempt here to analyze it theo- 
retically. Moreover, the notion of "stopping early" appears to be a rather non- 
traditional notion in the theoretical analysis of optimization algorithms. 



2. The Algorithm 

QQMM proceeds interatively beginning with a starting vector, h startVec. 
The algorithm generates a sequence of "trial minimizcrs" , ft,(2) i ■ • ■ G H 
until some criteria are met (subsection l2.4|) . Then it halts. QQMM is a descent 
method (jlSl . p. 427) so that after each iteration the value of F is strictly reduced. 
In fact, after each iteration F is "sufficiently" decreased (subsection 12. 3p . Not 
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all trial minimizers will decrease the value of F sufficiently. The trial minimizers 
that do decrease the value of F sufficiently are "iterates" . (The starting value of 
h is, trivially, considered to be an iterate.) Thus, an iteration is complete when 
an iterate is found, but F may have to be evaluated at several trial minimizers in 
order to find an iterate. When QQMM halts it returns (among other things) the 
last iterate (unless it halts because the evaluation count limit has been reached 
in which case it returns the best h it has found, iterate or not). 



2.1. Top level description 

Here, I describe, somewhat abstractly, the general structure of QQMM. Let (•, •) 
be an inner product on H. If ho G H, let g{ho) e _ff be a "subgradient" of V at 



ho (|2ll . p. 126). This means the following. Let phg be the affine function 

Ph„{h) ■.^V{ho) + {h-ho,9{ho)) (heH). (3) 

Then phg < V pointwise. This is only possible in general because V is convex. 
Typically, the actual gradient W{ho) at ho will exist (flil Theorem 4.2.3, p. 
189). If so, 5(/io)=VF(/io). 

Let h £ H. There is an operation that takes the triple (^h, F{h), g{h)^ to a 
global underestimator (3, p. 69) of F, i.e., a function that is nowhere larger than 
F fsubsubsection [2".2.ip . Call that global underestimator minor ant (h) . As the 
algorithm proceeds it in effect accumulates the minorants in a list. Call that 
list ^^minorants" . It is because QQMM accumulates information as it proceeds 
that the second "M" , for "Memory" , is included in the name of the algorithm. 

Let "M" = "reals" . Recall that V is non- negative. Let startVec G H. startVec 
will be a vector of starting values for the algorithm. In the following hi will 
denote the current iterate. 
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minorants ^ empty set; LargstLwrBndSoFar ^ £ M; 
eval .count ^ 0] hi ^ startVec] 
While eval .count < max.evals; 

If criteria for stopping are met then break out of loop; 
If F{hi) and g{hi) have not already been evaluated, 
then evaluate F{hi) and g{hi) and 

compute minorant(/ii); 
eval .count <— eval .count + 1; 
Append minorant{hi) to minorants] 
Use minorants and LargstLwrBndSoFar to 
find next trial minimizer, /12, 
and update LargstLwrBndSoFar] 
Evaluate F and g at /12 and append minorant{h2) to minorants] 
eval. count ^ eval. count + 1; 

While F{h2) is not suflSciently decreased from F{hi) 
and eval. count < max.evals + 1; 

Compute new h2 on the line {t{h2 ~ hi) + hi : < G R}; 

Compute F{h2) and g(h2)] 

eval .count <— eval .count + 1; 
End while; 

Append minorant{h2) to minorants] 
(*) If F(/i2) < then /ii ^ /i2; 

End while; 
Return hi] 

Remarks: 

1. minorants does not have to be initialized to the empty set. It could be 
initialized to a list generated by a previous run of the algorithm, for ex- 
ample. 

2. If F{h2) is sufficiently decreased from F{hi) then, of course, F(/i2) < 
F{hi). It is necessary to check F(/i2) < F{hi) at the step labeled (*) 
because the algorithm might break out of the sufficient decrease loop due 
to the evaluation limit having been reached. 

To complete the description of the algorithm I must spell out, first, what 
the minorants are and how they are used to generate the next trial search 
vector (and how LargstLwrBndSoFar is updated), second, what is meant by 
"sufficient decrease" and how a new h2 is chosen on the line through hi and ft,2, 
and, third, when to stop. These are described in subsections 12.21 12. 3| and 12. 4|, 
respectively. In subsection 12.51 1 describe a hybrid of QQMM and BFGS. 

2.2. Minorants 

In this subsection I give details on step "find next trial minimizer" in the 
algorithm outline in subsection 12. II The naive idea is to use at each iteration all 
the information about F accumulated so far to bound F below by another convex 
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function, J-. Then one chooses the niinimizer of J-' to be the next trial minimizer 
of F. However, that idea is impracticaL Instead, QQMM minimizes multiple 
convex functions bounding F below and chooses the one whose minimum value 
is largest. 

Suppose we have accumulated a list of trial minimizers. Let hi be the current 
iterate. Let LargstLwrBndSoFar be the current value of the lower bound to 
F. If the stopping criteria (subsection 12. 4p are not satisfied, QQMM needs to 
search for /12 £ at which F is "sufficiently decreased" compared to F{hi) 
fsubsection l2.3p . (Once such an /12 is found, QQMM sets hi <— /12 and, providing 
the evaluation limit, max.evals, has not been reached, repeats the process.) 
Iteration should, of course, stop if F{hi) — LargstLwrBndSoFar. So assume 
F{hi) > LargstLwrBndSoFar. 

2.2.1. Computing minorants 

Recall that V is non- negative, let ho G H, and recall the definition, ([3]), of phg. 
Then, again because V is convex, the function 

qhoih) ■■= q{h;hQ,V{ho),g{ho)) -.^ inax{pho{h),0} + Q{h), heH, 

is a convex global underestimator of F. (The function qhg is what is called 
minorant{h()) in subsection 12. II ) Call qho a "quasi-quadratic" function. (Figure 
[2] shows examples. The "QQ" in "QQMM" stands for "quasi-quadratic".) It is 
easy to minimize qh„ , and the minimum value is a lower bound on F that is at 
least 0. 

2.2.2. Using minorants to compute a trial minimizer 

Suppose the current list of minorants is minorants = {<z/i^), . . . ,'?h(„_i)}- Let 
/i(„) := hi be the last iterate. QQMM appends qh^„J to minorants. (More 
precisely, QQMM maintains lists of trial minimizers and corresponding values 
and subgradients of V. "Appending h(^n) to minorants" means appending ft.(„), 
V'(/i(„)), and the subgradient g(/i(„)) to the appropriate hsts. I.e., QQMM stores 
the "raw bundle", ijlTI . p. 228).) The (pointwise) least upper bound 

Tr, max{g,,,^j , qt^^^ qin^^r) : } 

is a convex global underestimator of F. (JF„ is a "model" of the convex func- 
tion F, (21", p. 157), m, Section 9.2).) In theory we could find the next trial 
minimizer by minimizing 

However, if n is large then minimizing J-"„ could be harder than minimizing 
F. Instead, QQMM proceeds as follows. For i = 1, 2, . . . , n — 1 let qh^i^M be the 
pointwise maximum 

qh^iy.hiW ■^^^^^{Qh^,^{h),qhl{h)} = max{(7,i(^, (/i),(7ft(^j(/i)}, he H. 
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Fig 2. Two quasi- quadratic functions on the line. A quasi- quadratic function is a "hinqe" plus 
a quadratic. In general, hinges do not fold at the origin. (In higher dimensions hinges fold 
along planes that in general do not pass through the origin.) Where the hinge folds influences 
the shape of a quasi- quadratic function, as shown here. 

This function is a convex global underestimator to F that is easy to minimize 
in closed form. (In the current implementation minimizing qh,hi requires the 
inverse of a matrix, Q, defining Q. If Q is ill-conditioned one can improve the 
conditioning by replacing Q by the matrix obtained by building up the diagonal 
of the matrix. This is illustrated in subsubsection [3TL31 ) 

Let i = j he the index in {1,2, ... ,n — 1} such that the minimum value 
of the function qhfifMi{h) is the largest, i.e., no smaller than the minimum 
value of any qiniy.hi {i — — 1). (So h^-^ has a "maximin" property.) 

Let /i™™ be the vector at which q}nj^,hi achieves its minimum. Denote the 
minimum value, qh^j-,,hi{h"^"^), by m. Let LargstLwrBndSoFar be the largest 
(constant) lower bound to F that QQMM has found so far. (Do not confuse 
LargstLwrBndSoFar, a lower bound on F that changes during the course of 
the calculation, with 0, a fixed lower bound on V.)lim > LargstLwrBndSoFar 
then the next trial minimizcr will be based upon /i2 :— ft,™™ and QQMM sets 

LargstLwrBndSoFar ^ m. 

However, if m < LargstLwrBndSoFar then QQMM performs a "shoreline" 
operation to arrive at /i2. (Think of LargstLwrBndSoFar as "sea level". See 
subsubsection 12.2.31 ) 

In general, QQMM does not use h2 as a trial minimizcr /i2. It sets 

/i2 ^ hi + daring x (/i2 ~ hi). (4) 
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In the numerical experiments described in this paper daring is just the con- 
stant 0.7 f subsubsect ion [^. 2 . 1 p . however, one might let daring be, say, a slowly 
decreasing function of the iteration number. 

Before replacing hi by /12 defined by ([4]) QQMM tests to see whether F{h2) 
represents a "sufficient decrease" from F{hi). This process is described in sub- 
section [273l If F(/i2) does not represent a "sufficient decrease" from F{hi), then 
/i2 is replaced by hi + 5{h2 — hi) for some user-specified constant 5 G (0, 1). For 
the calculations made for this paper, S := 1/6 (subsubsection [7. 2 . 1 p . 

2.2.3. "Shoreline operation" 

This means the following. Suppose m < LargstLwrBndSoFar. Let ft-i, be the 
vector at which is minimized. Then by definitions 

< LargstLwrBndSoFar < F(hi) = qhAhi). 
Hence, since qti is convex there exists a unique t G (0, 1) at which 

qhi — hi) + hi\ — LargstLwrBndSoFar. 

QQMM sets 1.2 := - hi) + hi. 

2.3. Sufficient decrease 

If a trial minimizer, /12, were accepted as an iterate if it merely reduced the 
size of the objective compared to the current best value, F{hi), the objective 
could conceivably decrease unacceptably slowly from one iteration to the next. 
To help prevent this, QQMM requires a "sufficient decrease" (0, p. 33) in F. 

Figure [3] shows how the sufficient decrease test works. The solid curves are 
graphs of F along the line passing through hi and /i2- The dotted horizontal lines 
are at a height equal to F{hi). The values /i2a through /12/ are possible positions 
of /i2 along the line. The sloping dashed lines pass through {hi, F{hi)) and have 
slopes proportional to the slopes of F at hi. (The constant of proportionality 
- call it (T - is a tuning constant of the algorithm; see subsubsection 13. 2. II For 
the calculations presented in this paper I used 1/6, but in the figure a ^ 1/6.) 
If F{h2) > F{hi) as at /12 = h2d, then there is no decrease, so certainly not a 
sufficient decrease. We have F{h2c) < P{hi) but the decrease in F at /12 = /i2c is 
not deemed sufficient because (1) the point (/i2c, F{h2c)) hes above the dashed 
line and (2) at /12 — /i2c the function F is increasing with increasing distance 
from hi. The point {h2f,F{h2f)) also lies below the dotted and above the 
dashed line but there the function F is decreasing with increasing distance 
from hi. Under those conditions the decrease in F at /12 = ^2/ is defined to 
be sufficient. At /12 = ^20, ^26, and /i2e, the point (/i2,F(/i2)) lies below the 
dashed line and so the decrease is defined to be sufficient at those points. So 
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(a) 



h2a 




(b) 



Fig 3. Sufficient decrease. Curves are graphs of F along the line passing through hi and 
h2- h2a through h2f are possible positions of h2 along the line. Sloping dashed lines pass 
through (^hi, . Horizontal dotted lines also pass through (hi, F(hi)^ . F shows sufficient 

decrease at h2a, h2b, h2e, and h2f, but not at h2c or h2d- 
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even though F(/i2c) < F{h2a), F exhibits sufficient decrease at /i2 = /i2a but 
not at h2 = /i2c- 

If F does not exhibit sufRcient decrease at /i2 then /i2 is replaced by 

h2 ^ hi + 5{h2 — hi). 

Providing eval. count < max.evals + 1, a new test for sufRcient decrease is then 
performed. In the computations done for this paper, I took S := 1/6 (subsub- 
section 13.2. ip . 

2.4- Stopping rules 

An important feature of QQMM is that it allows convenient stopping rules. That 
is because in the course of its operation, it produces a nondecreasing sequence 
of constant lower bounds, LargstLwrBndSoFar, to F. So as the computation 
proceeds one can, for example, bound the relative error by known quantities: 

F{hi) — LargstLwrBndSoFar ^ F{hi) — mini^ 
LargstLwrBndSoFar ~ min F 

Therefore, a natural stopping rule is to stop when 

F[hi) — LargstLwrBndSoFar ^ , , 

LargstLwrBndSoFar ' 

where e > is a prespecified constant. Then one knows that the relative error 
in the output is no greater than e. (As a precaution, one should also stop when 
g{hi), where g{hi) is a subgradient of F at hi, is very close to 0, when F{hi) is 
very close to 0, and when hi is very close to one of the vectors . . . , /i(„_i).) 
The right panel of figure [T] shows an example. In this paper I use e := 0.01. The 
horizontal axis is evaluation number (of F). The solid curve is the value of F 
produced by the corresponding evaluation. The dashed line shows, at evaluation 
i, the corresponding value of LargstLwrBndSoFar. 

Note that, obviously, when stopping rule ([6]) halts QQMM the real relative 
error will be strictly less than e. But on top of that, in practice the inequality 
([5]) will be strict. This leads one to suspect that in practice the true relative 
error at the final location of hi will tend to be considerably less than e. Panel 
(c) of figure O shows true relative errors from the simulation study presented in 
section [21 In that study e :— 0.01. The actual relative errors tended to be far 
smaller than that. 

As indicated in subsection 12.11 another way the algorithm halts is if the num- 
ber of function evaluations exceeds the limit max.evals. (There might be another 
evaluation or two before QQMM comes to a full stop.) 

Having good control over when the iteration stops is particularly important 
in statistical applications where the objective function F depends on random 
data. Since F is random it does not make sense to strive for high precision in 
approximating its minimizer. 
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2.5. "BFGS rescue" 

QQMM usually closes in on the minimum of the objective function rapidly 
but is sometimes very slow to converge. I.e., the distribution of the number of 
evaluations until convergence has a long right hand tail. To address this problem, 
one can employ a "BFGS rescue". This means the following procedure. First, 
run QQMM for a limited number of evaluations. // QQMM does not converge, 
then perform a limited number of iterations of BFGS from where QQMM left 
off. Denote this procedure by "QQMMj-escue" ■ See subsection 13.21 for a specific 
example. 

3. Numerical Example: Nonparametric L^/^-regression 
3. 1 . Simulation 

3.1.1. Strategy 

I carried out a simulation study of QQMM and BFGS. This means comparing 
QQMM and BFGS on many different problems. However, QQMM is designed 
for "expensive" F's, i.e., V^s that are time consuming to compute. Many runs of 
QQMM an BFGS on problems with expensive F's would take prohibitively long. 
(But I report anecdotally on a BFGS/QQMM comparison on an expensive V 
problem in section 21) As a way out of this quandary just note that, obviously, 
the reason a QQMM or BFGS application to a problem with an expensive 
V might take a long time is that it takes time to evaluate V. I.e., number of 
function evaluations, always a proxy for optimization time, is an especially good 
proxy when V is expensive to compute. This means we might get insight into 
the running times of QQMM and BFGS on problems with expensive V by doing 
a large simulation study for QQMM and BFGS on problems with inexpensive 
V's and taking, instead of running time, number of evaluations as the outcome 
measure of interest. This is the strategy I adopted. 

The mean number of evaluations is related to total amount of time needed 
for optimizations to finish. So the mean number of evaluations required by 
optimizer (and not, e.g., the median) is the crucial parameter. (It turns out 
that for comparison with BFGS, the median is, if anything, a more favorable 
parameter for QQMM than is the mean. But the mean is the one of more 
practical relevance.) 

The current implementation of QQMM and optim's implementation of BFGS 
count "evaluations" differently. For QQMM an evaluation means computing the 
value of F ("value evaluation") together with the subgradient, g, ("gradient 
evaluation"). Each time this happens eval. count is incremented by 1. BFGS 
counts value evaluations and gradient evaluations separately. By "total number 
of evaluations" by BFGS I will mean the number of value evaluations plus the 
number of gradient evaluations. The "total number of evaluations" by QQMM 
will mean a comparable number (essentially twice the final value of eval .count) . 
(In subsection 13.31 1 briefly discuss system time comparisons.) 
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True regression function and simuiated data True reg. fn. + estimate, compiex. param.= 0.1 True reg. fn. + estimate, complex. param.= 0.5 



o.a 1.0 0.0 0.2 



True reg. fn. + estimate, complex. param.= 1 True reg. fn. + estimate, compiex. param.= 2 True reg. fn. + estimate, complex. param.= 4 





Fig 4. Upper left panel: The function Q superimposed on 1000 data points generated from a 
model with regression function JT]!. The other five panels show the same function (solid lines) 
together with kernel estimates (dashed lines) with a variety of complexity parameter values 
and based on the data set shown in the upper left panel. (Upper left plot uses a different 
vertical scale than do the other plots.) 



3.1.2. Regression problem 

We compare QQMMrescue with BFGS on a simulated nonparametric regression 
problems. In each case, the "true" regression function is 



r(a;) := 0.5 + a; + 3a;^ + sin(5x), < x < 1 



(7) 



( 27|) . The graph of this function is shown in figure S) 

In each run, the simulated data consisted of 1000 data points. (While there 
is just one predictor variable, as explained in subsubsection 13.1.31 there are 
d :~ 1000 variables for optimization.) The predictors, xi, . . . , xiooo £^re uniformly 
distributed on (0,1). The errors are independent of the XiS and independent and 
identically distributed (i.i.d.) with t distribution with 7 degrees of freedom. Let 



Hi := r{xi) + erroTi, i = 1, 
(See figure[4]for a plot of one such data set.) 



, 1000. 
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3.1.3. Estimation 

Employ kernel-based L^/^-regression with Gaussian kernel 
K{x,x') := exp{--100|x- x'p} 

(0). So in this case 

1000 

F(h) = F(h;^,y) = J] |y. - h{x,)\'^' + A||/j||^. (8) 

1=1 

Here, \\h\\K is the norm in the RKHS, Hk, corresponding to K and A is the 
"complexity parameter." By the "kernel property" (fisl , p. 146), the infinite 
dimensional problem of minimizing F in ([8]) over Hk reduces to the finite di- 
mensional problem specified by taking 

1000 

h{x) := ^ hiK{x,Xr), 

i 

where /ii, . . . , /iiooo £ K and, using the symbol h to denote the vector 
{hi,..., ft-iooo), we have 

\\h\\l = hGh'. 

Here, G is the 1000 x 1000 "Gram matrix" (6, p. 169), whose («, j)*'' entry is 
K{xi, Xj). Thus, 

g(/i) = A||/i||^ = AhGh'^. (9) 

(I chose L^^^ regression instead of regression because kernel-based re- 
gression has a closed form solution (@, section 6.2.2) so does not require numer- 
ical optimization. Another standard choice would be kernel-based (absolute 
deviation) regression, but the criterion (V) is not everywhere differentiable 
and BFGS is designed for differentiable functions.) 

It turns out that for the simulated data sets described in subsubsection 13 . 1 ."21 
the Gram matrix, G, is often poorly conditioned. But to compute the minima 
of the qh^i),hi's (subsubsection [2". 2 . 2p QQMM requires the inverse of the matrix, 
AG, of Q. As a work around, for each randomly generated data set, in ^ the 
matrix G is replaced by G -I- 51, where I is the 1000 x 1000 identity matrix and 6 
equals 0.005 times the mean of the diagonal entries of G. (Of course, the mean 
of the diagonal entries equals the mean of the eigenvalues. In fact, the diagonal 
of G consists of I's, so each diagonal entry gets replaced by 1.005.) 

Figure [4] shows estimates of r for various values of A and one simulated data 
set. The estimates with A := 0.1 and 0.5 show "overfitting" . The estimates with 
A := 1 or 2 tracks the target, r, rather well. The estimate with A 4 exhibits 
"underfitting" . 

(In practice one might use cross-validation to assess the quality of estimates. 
This paper is concerned with the computational, not statistical, aspects of this 
estimation problem. In the course of model tuning one will normally fit models, 
and so perform optimization, with A too large and too small as well as near 
optimum A.) 
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3.2. Compare QQMM and BFGS in computing L^^^ -estimates 

For each value of A := 0.1, 0.5, 1, 2, and 4, 1 generated 1000 independent data sets 
as above. For each such data set I also generated an independent random vector, 
z, of 1000 i.i.d. standard normal variates. (z was generated independently for 
each of the 1000 data sets.) Then I minimized F using BFGS and QQMMrescwe 
for each of the 1000 data sets, in each case starting from the corresponding z. As 
noted above, we are interested in the mean number of evaluations used by each 
optimization method. Of course, an acceptable level of accuracy also required. 

I used BFGS with an iteration limit of 1,000,000 ("BFGSi,ooo,ooo") as the 
standard. (Do not be put off by the "1,000,000". Actually, BFGSi, 000,000 never 
needed more than 250 evaluations.) I compared BFGSi^ooo.ooo to QQMMrescue 
with 

• Stopping threshold e 0.01. 

• At most K, max.evals :— 110 QQMM evaluations. 

• Followed by an at most max.iter := 35 iteration BFGS rescue, if needed. 
3.2.1. Choosing the tuning constants 

First, I describe how the parameters max.evals and max.iter were arrived at. 
By varying two parameters one can hope to achieve two goals. The first goal 
was comparability between QQMMrescue and BFGS. The optimizers BFGS and 
QQMM can be tuned in many different ways. In order to make the comparison 
meaningful we need to use comparable versions. In the simulations max.evals 
and max.iter were chosen so that for A :— 0.1, 0.5, and 1, the maximum num- 
ber of evaluations QQMMrescue needed was never smaller than the maximum 
number needed by BFGSi, 000,000- The idea is to rule out the possibility that a 
lower mean number of evaluations for QQMM^esctie compared to BFGSi_ooo,ooo 
would simply be due to QQMMrescue being allowed fewer evaluations than the 
number needed by BFGSi, 000,000- (Of course, the maximum number of evalua- 
tions is a rather unstable statistic. This makes the tuning rather approximate.) 
That parity broke down for A > 2, where QQMM needed far fewer evaluations. 
For A := 2, BFGS with iteration hmit of only 10 ("BFGSio") was also run on 
the same simulated problems fsubsubsection [?^3.ip . The second goal driving the 
choice of max.evals and max.iter was that the relative errors for QQMM^escue 
be of tolerable size in the simulations. 

The choice of stopping threshold e := 0.01 was a judgment that a relative 
error less than 1% would be acceptable for most statistical purposes. (As we will 
see in subsection l3.3l the actual relative errors were usually far smaller than 0.01.) 
The other internal tuning constants in QQA/IM, viz. daring fsubsubsection [2^2 . 2p 
and the tuning constants that define sufficient decrease, viz., a and S (subsection 
12. 3p were chosen as follows. A simulation study was carried out in which tens of 
thousands of random two-dimensional optimization problems (not shown) such 
as that shown in figure [1] were generated and optimized using QQMM (without 
rescue) with various choices of the tuning constants. A large regression model 
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of evaluation count vs. tuning constants was fitted to the results. The tuning 
constants daring, a, and 6 were chosen to approximately minimize the fitted 
regression function. (Thus, the tuning constants were chosen based on problems 
different from the regression problems used for testing QQMM.) 

3.3. Results 

The results are summed up in figure [5] We see from panel (b) that for A := 
O.f, QQMMresctie required, on average, about 13% more evaluations than did 
BFGSi_ooo,ooo- Again with A :— O.f, QQMMrescue required rescue every time 
(panel (d)). As A increased the average number of evaluations required by 
QQMM decreased in absolute terms (panel (a)) and relative to the average 
number required by BFGSi, 000,000 (panel (b)) and and QQMM required "res- 
cue" less often (panel (d)). By the time A reached 2, QQMMrescue required only 
a little less than one third as many evaluations as did BFGSi^ooo.ooo and did 
not require "rescue" at all. 

Figure [5] (b) also shows what happened when no rescue was made. Then 
with A O.f, QQMM required 60% more evaluations than did BFGSi, 000,000- 
However, for A :— 0.5 or f , QQMM was actually a little faster without rescue 
than with. The additional evaluations needed by QQMMrescue for these A values 
can be thought of as "insurance premiums" to pay for a reduction in needed 
evaluations at A := 0.1. 

Define acceptable accuracy to mean relative error (compared to 
BFGSi^ooo.ooo) less than 1%. (Recall that the accuracy target, e ~ see subsec- 
tion [211]- had been set to 1%.) In that case, the accuracy of QQMMrescue was 
almost always acceptable. When A :— 0.1 the maximum relative error in the 
1,000 problems was 1.4%. The 95*'* percentile of relative error when A := 0.1 
was 1%. For A :— 0.5, 1, 2, or 4 the relative error was always less than 1%. In 
fact, figure [5] panel (c) shows that most of the time the error was substantially 
less than 1%. 

Interestingly, as remarked in subsection 11.31 QQMM itself is fairly computa- 
tionally intensive and so that for "cheap" V^s, like the L"^/^ criterion employed 
in this section, one might expect that QQMM to be slower in terms of actual 
running time than, say, BFGS. However, it turns out that for all five choices 
of A except A := O.f QQMMrescue actually had a shorter mean elapsed time 
than did BFGSi, 000,000- (Elapsed times were computed using the R function 
system. time, Becker et al ([ll, pp. 218-220).) This is particularly interesting 
in that optim must surely be a far more optimized program than is the still 
experimental QQMM. 

3.3.1. The A := 2 case 

With A 2, the maximum number of evaluations required by QQMMrescue was 
87. (In particular, QQMM never required "rescue" in this case.) The maximum 
required by BFGSi_ooo,ooo was 192. Thus, in this case the versions of the two 
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Fig 5. Summaries of some simulation results. In each case the variable on the horizontal 
axis is the complexity parameter, X, Panel (a) shows the mean absolute number of evalua- 
tions required by QQMMrescue (solid curve) and BFGSifioo,ooo (dashed curve; in the figure 
"BFGS" means "BFGSifioofiOo")- Panel (b) shows the ratio of the mean number of eval- 
uations of QQMMrescue to that of BFGSifioofiOO (solid curve) and QQMM (without res- 
cue) to BFGSifioofioo (dashed curve). Panel (c) shows the third quartiles of the errors of 
QQMMrescue relative to BFGSi fioo,ooo- Panel (d) shows the proportion of simulations for 
which QQMM needed "rescue". 
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Optimizer 


Avg. number 
of evaluations 


Max. number 
of evaluations 


Min. relative 
error 


Median rela- 
tive error 


QQMM 


52.9 


87.0 


5 X 10'* 


0.0014 


BFGSio 


69.1 


73.0 


0.12 


2.3 



Table 1 

Some statistics from simulation with X := 2. Note that here relative error is not expressed 

as a percent. 

optimizers were not comparable. So for A := 2, I also compared QQMM to 
"BFGSio" (BFGS restricted to a 10 iteration limit). BFGSio averaged slightly 
more evaluations (mean = 69) than did QQMM (mean — 53). (See Table [TJ) 
But the table shows that BFGSw was very inaccuarate, especially compared to 
QQMM. The smallest relative error for BFGSio (compared to BFGSi, 000,000) 
was 12%! The median of the ratio 

relative error for BFGSio 
relative error for QQMM 

was 1787 (!). (Here again "relative error" means error relative to BFGSi, 000,000-) 
4. Summary, Discussion, and Conclusions 

Good, general optimizers for convex functions of the form where y is a 
known non-negative convex function and Q is a known quadratic form, are ap- 
parently not readily available. QQMM is an apparently new optimizer designed 
for such problems. QQMM was designed to exploit the known structure of F to 
find, with a small number of function evaluations, an approximate minimizer of 
F. QQMM makes use of two means, early stopping and careful choice of trial 
minimizers, to reach this goal. 

A hybrid algorithm that employs "BFGS rescue" (subsection 12. 5p was com- 
pared to the well regarded BFGS algorithm, the latter with a generous iteration 
hmit. QQMM with BFGS rescue, denoted QQMMrescue, means the following. 
If QQMM does not converge after a limited number of evaluations, BFGS is 
run for a limited number of iterations from where QQMM left off. 

The comparison was based on fitting L^^^ nonparametric regression models 
to simulated data. The regression algorithm employed a "complexity parame- 
ter", A. To make the two optimizers comparable, I set the evaluation/iteration 
cutoffs for QQMMj-esctie so that, at least for small A, the maximum number of 
evaluations of the hybrid was at least as large as that of BFGS. For production 
work different evaluation limits on QQMM and iteration limits on BFGS may 
be preferable. Moreover, with today's multicore computers another way to com- 
bine QQMM and BFGS presents itself: Run both QQMM, without rescue, and 
BFGS simultaneously! When one converges, stop the other one. 

QQMM is an apparently new algorithm, so it was necessary to find some 
reasonable values for the tuning constants described in subsubsection 12.2.21 and 
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subsection 12. 31 This was done by systematically experimenting on tens of thou- 
sands of random two-dimensional optimization problems like the one illustrated 
in figure [TJ (In particular, the tuning constants were chosen based on prob- 
lems different from the regression problems that I used for testing in section [3l) 
Further work is needed in choosing good tuning constant values. 

I argued that the mean number of evaluations needed for convergence was 
the proper outcome measure for assessing the performance of an optimizer in 
the context of interest. The reason is that QQMM is designed especially for 
problems in which function evaluation is expensive. 

We found that, except for the lowest value of A tested, where QQMMresctte 
required 13% more evaluations, on average, than did BFGS, QQMMresctte re- 
quired substantially fewer evaluations, on average, than did BFGS. Moreover, 
wc found that in the example discussed in section [31 in which function evalua- 
tion was fast, usually QQMMrescue was faster than BFGS in terms of elapsed 
time as well as in number of function evaluations. I also tested QQMM with- 
out rescue, i.e., allowed to run until it converged. It was at least as fast as the 
rescued version except for the lowest value of A, where it required substantially 
more evaluations than did BFGS. 

Now, for the lowest value of A the algorithm over-fitted the data. In model 
fitting, one fits more models which seem to fit the data pretty well than one 
does fitting models that overfit. So based on our simulations one expects that 
in model fitting, a large majority of the time QQMM would out perform BFGS. 

Might BFGS be improved to work well with the sort of problems studied 
here? It probably could. However, QQMM could then also be further refined 
and an "arms race" between BFGS and QQMM might ensue. The purpose of 
this paper is not to report on such an arms race. Instead, my goal here is merely 
to introduce the QQMM algorithm and give preliminary test results concerning 
its performance. 

As remarked in subsubsection 12.2.21 and in subsubsection 13.1.31 QQMM re- 
quires the inverse of a matrix, Q, defining Q. In the calculations described in this 
paper I improved the conditioning of Q by adding a small constant to each di- 
agonal element of Q. This procedure actually changes the optimization problem 
slightly. From a practical, statistical standpoint the solution to the approximate 
problem may be just as good as the solution of the exact problem. 

However, it is worth investigating the possibility of a version of QQMM that 
does not require Q~^. In the meantime, one easy fix for this problem is to always 
follow up QQMM with a short BFGS run, even when QQMM does not require 
"rescue", starting from where QQMM left off but applying BFGS to the original 
problem with the unmodified Q. A similar strategy could also be applied when 
V is unbounded below and so, as suggested in subsection 11.21 is replaced by 
max{ V — c, 0} for some constant c. 

Another improvement to QQMMrescue might be possible. For the BFGS res- 
cue we started BFGS from where QQMM left off, presumably in some rough 
neighborhood of the true minimizer. However, one can imagine using the QQMM 
run to provide even more information to BFGS. BFGS uses a matrix, call it B, 
that functions like the inverse Hessian in Newton's method. BFGS computes 

imsart-ejs ver. 2008/01/24 file: ejs_2008_330.tex date: November 18, 2008 



S.P. Ellis/Convex Optimization 



22 



that matrix recursively as it evaluates F and its gradient. One could imagine 
having QQMM give BFGS a further head start by having QQMM compute an 
initial B for BFGS based on its (i.e., QQMM's) evaluation history. So far, my 
experiments with this procedure have not proved successful. However, I believe 
this idea is worth further work. 

I began this paper (subsection II. ip by mentioning my efforts to develop a 
nonparametric hazard function estimator. My impression was that QQMM was 
much faster than BFGS for hazard estimation. This nonparametric estimator is, 
at this writing, unpublished. However, if I may, I conclude this paper by describ- 
ing what happened the one and only time I used both BFGS and QQMM for the 
same hazard estimation problem. (This obviously has only anecdotal bearing on 
the question of the relative speeds of the two methods.) The comparison was 
made in estimating the hazard from a real data set. The kernel and complexity 
parameter for F as in ^ were such that the model fit well. (In particular, the 
estimator was tuned so that it did not overfit.) The number of variables in the 
optimization was d = 2508. QQMM (without rescue but tuned as in section [3]) 
took less than 30 minutes. BFGS took almost 5.3 hours. So BFGS took over 11 
times longer than QQMM. 

However, one might wonder how long BFGS would take had it been stopped 
more efficiently. Define "efficient stopping" to mean stopping as soon as a value 
of the objective function had been reached that was less than 1% more than 
the minimum value of the objective function, (e := 1% in the QQMM run.) 
I ran BFGS again using the best possible stopping rule, viz., one based on 
an "oracle" . (An "oracle" is a source of information not ordinarily available in 
practice.) The 5 hour BFGS run provided a good estimate of the minimum value 
of the objective. From that the value one can accurately compute the value 1% 
more than the minimum. Again, the 5 hour BFGS run revealed that BFGS first 
reached below the 101% threshold on the 180*'' iteration. So I ran BFGS again 
this time with an iteration limit of 180. With this stopping rule BFGS ran for a 
little over 2 hours before stopping. Thus, in this particular "real world" instance, 
QQMM without an oracle ran almost 4.5 times faster than BFGS did with an 
oracle. (QQMM computes an upper bound on the relative error, but without an 
oracle it cannot compute the true relative error.) 

This anecdote and the experiment with early stopping of BFGS in the L"^/^ 
estimation problem with A := 2 fsubsubsection [5".3.ip suggest that the reason 
that QQMM is ordinarily faster than BFGS docs not merely lie in the fact that 
QQMM allows good control of stopping (so stopping early is possible). QQMM 
generally makes the objective function decrease faster than does BFGS. (Of 
course, QQMM is specially designed for problems of the form ([2]) with convex 
V bounded below by a known value and known Q. BFGS can handle a much 
broader range of problems. ) 

Based on the numerical experiments described in section [3l experiments with 
tens of thousands of random simulation problems like that shown in figure [TJ 
and applications of QQMM to survival analysis mentioned in subsection 11.11 and 
above, QQMM seems to meet its goals rather well. 
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